Henry
Abstract:Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.
Abstract:In semiconductor manufacturing, lithography projects circuit layouts onto silicon wafers through an optical mask. As circuit features shrink below the wavelength of light, optical diffraction causes the printed patterns to deviate from their intended layouts. Inverse Lithography Technology (ILT) addresses this challenge by generating optimized masks that enhance the fidelity of pattern transfer onto wafers. While ILT resembles an image synthesis task, its reliance on explicit physical metrics for mask evaluation limits the applicability of existing generative models. We introduce LithoGRPO, an ILT framework that integrates the flow-matching paradigm with GRPO-based reinforcement learning (RL) fine-tuning, enabling efficient exploration of diverse masks for a given target layout. Unlike purely generative or optimization-based approaches, RL in LithoGRPO exploits the explicitly defined, physics-based reward function of ILT, enabling optimization under complex, process-aware constraints. To the best of our knowledge, this is the first framework that unifies flow matching and RL for mask optimization. To improve RL sampling efficiency, we propose a fast shot-counting algorithm for manufacturability evaluation, achieving over 130x speedup while preserving the mask ranking of the traditional shot-count metric. Extensive experiments demonstrate that LithoGRPO achieves state-of-the-art performance over both optimization-based and learning-based methods, while maintaining efficient mask generation.
Abstract:Tactile sensing is essential for robots to achieve human-like gentle manipulation. However, existing Vision-Language-Action (VLA) models struggle to exploit tactile feedback for gentle manipulation due to scarce aligned vision-tactile-language data and the lack of effective closed-loop force feedback mechanisms. To address these challenges, we introduce Tabero, a benchmark and model suite for gentle, language-conditioned robotic manipulation that demands fine-grained contact force perception. First, the Tabero benchmark addresses the scarcity of tactile data by presenting a data-efficient pipeline that repurposes open-source robot manipulation trajectories to generate diverse vision-tactile-language tasks, and establishes a multidimensional evaluation protocol that measures task success alongside physical interaction quality. Second, we propose Tabero-VTLA, an architecture with a decoupled force-position command interface; the resulting force-position commands are executed by a fixed hybrid controller to enable real-time, force-aware manipulation. Evaluated on Tabero, our model maintains high task success while reducing average grip force by over 70\% under gentle instructions, demonstrating its ability to modulate interaction forces based on multimodal experience. Our code is publicly available at https://github.com/NathanWu7/Tabero.
Abstract:Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.
Abstract:Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical iterative solvers often suffer from prohibitive runtime due to ill-conditioned systems. While neural operators offer a fast alternative, they typically rely on large-scale labeled datasets or struggle with unstable training dynamics when using physics-informed residual losses. We propose \textsc{NPSolver}, a neural Poisson solver trained without solution labels via iterative physics supervision. Instead of relying on fully converged numerical solutions or raw PDE residuals, \textsc{NPSolver} utilizes a small number of preconditioned conjugate gradient (PCG) steps to refine its own predictions, providing a more stable and well-scaled training signal. Theoretical analysis confirms that this iterative supervision serves as a well-conditioned error proxy and that a stop-gradient design is essential for optimization stability. To better capture boundary-driven features under mixed boundary conditions, we further introduce the Boundary-Aware Transolver (\textsc{BA-Transolver}) architecture that explicitly separates interior and boundary tokenization. Extensive evaluations on 2D and 3D irregular geometries demonstrate that \textsc{NPSolver} outperforms both physics-informed and data-driven baselines. Furthermore, a downstream thermal control task highlights the model's capability for conducting efficient and reliable gradient-based boundary control. We will release our codes and data at https://github.com/intell-sci-comput/NPSolver.
Abstract:This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the abstention reward during Group Relative Policy Optimization (GRPO) training. The objective of this work focuses on abstention learning instead of improving truthfulness, serving as an exploration into hallucination reduction. The novelty of this paper lies in methodological innovation, advantage re-weighting, and benchmark selection. Leveraging GRPO's multiple trajectories as a natural abstention signal, this method uses a reward signal to explore knowledge boundaries and encourage consistency. By demonstrating that trajectories can be used as a confidence indicator of the policy relative to the query, they are then used to dynamically calculate the abstention advantage. AbstentionBench is used as the evaluation benchmark, as this work aims to contribute to the field of abstention learning. All datasets on the benchmark were tested against this method and various baselines. Empirical results demonstrate that TIAR achieves state-of-the-art abstention F1 scores across five of six evaluation categories, outperforming the static ternary baseline on 17 of 31 benchmark datasets while fully preserving baseline accuracy.
Abstract:In this paper, we propose a tri-domain reconfigurable multiuser multiple-input multiple-output (MIMO) communication system that integrates the electromagnetic (EM) reconfigurable antenna (EMRA) with the spatially movable antenna (SMA), termed the spatial-EM reconfigurable antenna (SEMRA). The proposed system offers EM, spatial, and digital domain degrees of freedom (DoFs) for joint channel reconfiguration, yet introduces new challenges in channel estimation (CE) and precoding optimization. Specifically, for multiuser orthogonal frequency division multiplexing (OFDM) downlink, the precoding design is formulated as a tri-domain optimization problem over antenna positions, EM-domain radiation-pattern weights, and digital precoders. We first develop a zero-forcing (ZF)-based baseline algorithm to decouple the design of spatial reconfiguration, and then propose a weighted minimum mean square error (WMMSE)-based tri-domain joint optimization algorithm for further improving the spectral efficiency (SE). Furthermore, we propose a low-overhead movement-aided channel estimation scheme in which coordinated antenna repositioning across pilot slots synthesizes a denser virtual array, enabling more accurate angle-of-departure (AoD) estimation and EM-domain channel state information (eCSI) reconstruction under the same per-user pilot overhead as the EMRA baseline. The resulting parametric representation enables eCSI assembly at desired antenna positions without additional pilots. Simulation results show that the proposed CE scheme improves eCSI estimation accuracy and the proposed SEMRA achieves higher SE than the EMRA baseline under the same pilot overhead.
Abstract:Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.
Abstract:In this letter, we propose a new wireless sensing system equipped with a rotatable antenna (RA) array to enhance the sensing performance of a uniform sparse array (USA). To tackle the severe spatial undersampling issues, we propose a novel tensor decomposition-based direction-of-arrival (DOA) estimation algorithm. Specifically, we introduce a synchronous multiple rotation pattern for active target probing such that the received signals across multiple rotations to capture the diverse spatial degree of freedoms. Subsequently, we mathematically formulate the received signals across successive rotations as a third-order tensor, and leverage the canonical polyadic decomposition to obtain the factor matrices incorporating the DOA of targets. By analyzing the extrema distribution laws of array steering vector correlation (SVC) and gain SVC of RAs, we propose to combine the array and gain factor matrices via the Kronecker product, which theoretically guarantees the unambiguous DOA estimation. Simulation results demonstrate that the proposed RA-enhanced tensor decomposition-based algorithm achieves high-precision and unambiguous sensing performance compared to conventional uniform dense arrays and omnidirectional antenna systems.
Abstract:In this paper, we investigate a multi-cell six-dimensional movable antenna (6DMA) network for enhancing downlink communication performance under inter-cell interference (ICI). Each base station (BS) is equipped with multiple 6DMA surfaces, and the 6DMA rotations affect both the desired-signal enhancement for in-cell users and the interference leakage toward neighboring cells, which makes the antenna-rotation design and transmit precoding intrinsically coupled across BSs. To address this issue, we formulate an average weighted sum-rate maximization problem for the multi-cell system by jointly optimizing the short-term downlink precoders and long-term 6DMA rotations under practical antenna geometric constraints. To tackle the resulting nonconvex problem, we propose a distributed two-timescale design based on inter-cell interference power constraint (IPC) coordination among neighboring BSs, under which each BS performs local short-term precoder optimization based on instantaneous channel state information (CSI) and long-term 6DMA rotation update according to statistical CSI with limited inter-BS information exchange. In particular, an edge-wise IPC coordination mechanism based on two-stage one-dimensional grid search and random maximal matching is developed to enable scalable distributed implementation. A centralized offline benchmark is also provided for performance comparison. Numerical results show that the proposed distributed design achieves performance close to the centralized benchmark under different interference conditions, while maintaining favorable scalability as the network size increases.